In this section, we first discuss the STT-RAM preliminaries. This is followed by a detailed discussion of
STT-RAM models which we have developed to guide us in our exploration of suitable STT-RAM device for
LLC. Based on this, we show three stable/feasible STT-RAM cell designs
suitable for LLC in multi-core architectures.
%\begin{wrapfigure}{l}{0.50\textwidth}
%%\begin{figure*} [t]
%\centering
% \psfig{figure=figures/stt-cell.eps, width=1.6in, height=0.4\textwidth, angle=-90}
% \caption{\label{fig:mram_cell} (a) Structural view of an STT-RAM Cache Cell (b) Anti Space Parallel High Resistance, Indicating ``0" state (c) Parallel Low Resistance, Indicating ``1" state}
%%\end{figure*}
%\end{wrapfigure}
\subsection{Preliminary on STT-RAM}
\begin{figure*} [t]
\centering
\begin{minipage}{0.575\textwidth}
\centering
 \psfig{figure=figures/stt-cell.eps, width=1.5in, height=0.85\textwidth, angle=-90}
 \caption{\label{fig:mram_cell} (a) Structural view of an STT-RAM Cache Cell
 (b) Anti-Parallel High Resistance, Indicating ``0" state (c) Parallel Low Resistance, Indicating ``1" state}
\end{minipage}
\hfill
\begin{minipage}{0.375\textwidth}
\centering
 \psfig{figure=figures/IcWt.eps, width=0.85\textwidth, height=1.5in}
 \caption{\label{fig:IcWt} Demonstration of three switching phases:
 thermal activation, dynamic reversal and precessional switching }
\end{minipage}
\end{figure*}

STT-RAM uses Magnetic Tunnel Junction(MTJ) as the memory storage and leverages the difference in
magnetic directions to represent a memory bit (``0"/``1" state). As shown in
Figure~\ref{fig:mram_cell}, an MTJ contains two ferromagnetic layers. One ferromagnetic layer has a
fixed magnetization direction and called the reference layer. The second layer's magnetic
direction can be changed by passing write current, and, thus it is called the free layer. The
relative magnetization direction of two ferromagnetic layers determines the resistance of the MTJ.  If
two ferromagnetic layers have different directions, the resistance of MTJ is high, indicating a ``0"
state; if two layers have the same directions, the resistance of MTJ is low, indicating a ``1"
state. The current amplitude required to reverse the
direction of the free ferromagnetic layer is determined by the size, the aspect ratio of MTJ, and
the write pulse duration~\cite{STTRAM:JAP07, STTRAM:Qualcomm09}.

\subsection{Write current Vs. Write pulse width trade-off} \label{subsec:ict}

%\begin{wrapfigure}{l}{0.40\textwidth}
%\centering
% \psfig{figure=figures/IcWt.eps, width=0.40\textwidth, height=1.45in}
% \caption{\label{fig:IcWt} Demonstration of three switching phases: thermal activation, dynamic reversal and precessional switching }
%\end{wrapfigure}

The current amplitude required to reverse the direction of the free ferromagnetic layer is determined
by many factors like material property, device geometry and most importantly, the write pulse
duration. Generally, when longer write pulse is applied, lesser switching current is sufficient to 
change the MTJ state. In~\cite{STTRAM:JAP07}, three distinct switching modes were identified 
based on the operating range of switching pulse width $\tau$: thermal activation ($\tau>20ns$),
precessional switching ($\tau<3ns$) and dynamic reversal ($3ns<\tau<20ns$). The relationship between switching current density ($J_{c}$), and write pulse width ($\tau$) in these
three operating ranges are characterized by the following analytical model described in~\cite{STTRAM:IEDM09}:

%\begin{eqnarray}
 {
 \small{
\noindent $J_{c,TA}(\tau) = J_{c0}\{1- (\frac{k_{B}T}{E_{b}})ln(\frac{\tau}{\tau_{0}})\}$ 
\hspace{1mm} \textbf{(1)} \hspace{1mm} $J_{c,PS}(\tau) = J_{c0}+ \frac{C}{\tau^{\gamma}}$ 
\hspace{1mm} \textbf{(2)} \hspace{1mm} $J_{c,DR}(\tau) =
\frac{J_{c,TA}(\tau)+J_{c,PS}(\tau)e^{-k(\tau - \tau_{c})}}{1+e^{-k(\tau - \tau_{c})}}$ \hspace{1mm}
\textbf{(3)}
 }
 }
%\end{eqnarray}

where, $J_{c,TA}$, $J_{c,PS}$, $J_{c,DR}$ are the switching current densities for thermal activation,
precessional switching and dynamic reversal respectively. $J_{c0}$ is the critical switching current
density, $k_{B}$ is the Boltzmann constant, $T$ is the temperature, $E_{b}$ is the thermal barrier,
and $\tau_{0}$ is the inverse of attempt frequency. $C$, $\gamma$, $k$, and $\tau_{c}$ are fitting
constants. Based on Figure~\ref{fig:IcWt} and analysis of the analytical model, we found very
different switching characteristics in the three switching modes. For example, in the thermal
activation mode, the required switching current increases very slowly even if we decrease the write
pulse width, and thus, short write pulse width is more favorable in this
region since reducing write pulse can reduce both the write latency and the energy without much penalty on
the read latency and energy. In the precessional switching mode, write current goes up rapidly as we
reduce the write pulse width, therefore, minimum write energy of the MTJ is achieved at some
particular write pulse width in this region. Based on this analysis, this paper will focus on the
exploration of write pulse width in precessional switching and dynamic reversal to optimize for
different design goals.

\subsection{STT-RAM Modeling}

Before simulating the performance characteristics of the STT-RAM cache, we first introduce models to
determine the area of a STT-RAM cell. Area of each STT-RAM cell would determine the area of a cache
bank composed of these cells and in turn influence the read/write latency of the bank. As shown
earlier in Figure~\ref{fig:mram_cell}, each 1T1J STT-RAM cell is composed of an NMOS and one MTJ. The
NMOS access device is connected in series with the MTJ. The size of NMOS is constrained by both SET
and RESET current, which are inversely proportional to the writing pulse width. In order to estimate
the current driving ability of MOSFET devices, a small test circuit using HSPICE with PTM 45nm HP
model~\cite{PTM} is simulated. The driving current is obtained by assuming typical TMR (120\%) and
Low Resistance State (LRS) ($3k\Omega$) value~\cite{STTRAM:Qualcomm09} and wordline voltage to be 1.5V (the optimal value is
extracted from~\cite{STTRAM:Gatech10}). Further, we oversize the access transistor width to guarantee
enough write current is provided to MTJ using the methodology discussed in~\cite{STTRAM:RPI10}. To
achieve high cell density, we model the STT-RAM cell area by referring to the DRAM design
rules~\cite{DRAM:6F2}.  As a result, the cell size of a STT-RAM cell is given as:

%\begin{equation}
 {
 \small{
 \hspace{55mm} $\mathrm{Area}_{\mathrm{cell}}={3\left(W/L+1\right)}(F^2)$ \hspace{3mm} \textbf{(4)}
 }
 }
%\end{equation}

where, $W$ and $L$ are the channel width and length of the access NMOS transistor, respectively.

\subsection{Impact of Retention Time on MTJ Characteristics} \label{subsec:retention}



The retention time of a MTJ is largely determined by the \textit{thermal stability} of the MTJ. The relationship
between retention time and thermal barrier is shown in Figure~\ref{fig:retention}, which can be
modeled as $t=C\times e^{k\Delta}$, where $t$ is the retention time and $\Delta$ is the thermal
barrier, while $C$ and $k$ are fitting constants. Thermal stability of the free layer in an MTJ has impacts not
only on the retention time of STT-RAM memory cell, but also on the write current. It was found
in~\cite{STTRAM:Grandis11} that the switching current of MTJ decreases as thermal barrier is reduced. Here, we combine this observation with the write current versus write time trade-off described in Section~\ref{subsec:ict}, which essentially means that once the thermal barrier of a MTJ is lowered, we are able to achieve faster write speed or/and smaller write current/energy. The baseline MTJ in our study is a $2F^2$ in-plane MTJ with a thermal barrier of $72k_{B}T$, where $k_{B}$ is the Boltzman constant and $T$ is the temperature. The non-volatile MTJ can hold data for more than 10 years under a worst-case temperature of $125\,^{\circ}\mathrm{C}$. Unlike~\cite{STTRAM:HPCA11}, the optimized $2F^2$ cell does not have much room for reducing the planar area. Hence, reduction in thermal barrier was obtained by decreasing the thickness of the free layer and lowering the saturation magnetization. For example, we are able to get two volatile MTJs with reduced thermal barriers of $46k_{B}T$ and $40k_{B}T$, which corresponds to retention times of $1sec$ and $10ms$ under $125\,^{\circ}\mathrm{C}$, respectively. Raw experimental data is obtained from our device collaborator Qualcomm's STT-RAM group and we further did curve fitting by using the in-plane MTJ device equations (1)-(3). The results are plotted in Figure~\ref{fig:currentVStime}. Next we explain how the volatile MTJs are optimized for both latency and energy. The $10ns$ write pulse width is used as the practical operating point A($10ns$, $114\mu A$) in our baseline MTJ similar to~\cite{CACTI:DAC08:Dong}. Simply by fixing the write pulse width at $10ns$, we are able to lower the write currents of the two volatile MTJs which operate at B($10ns$, $73\mu A$) and C($10ns$, $40\mu A$). Then we apply the write current versus write time trade-off on the operating points of the volatile MTJs to reduce the write latency. Particularly, we operate the MTJ with 1sec-retention at point B'($5ns$, $82\mu A$) and the write current is 20\% lower than that of the baseline A. Moreover, we operate the MTJ with 10ms-retention at C'($2ns$, $61\mu A$) and the write current is 20\% lower than that of B'. The figure depicts that the write current and write pulse width for $10ms$ retention time (point C') is lower than that of $1sec$ retention time(point B').

%This challenge is further compounded by device variations, where data retention times follows a
%distribution.

\subsection{STT-RAM Cache Simulation Setup}

\begin{figure*} [t]
\centering
\begin{minipage}{0.575\textwidth}
\centering
 \psfig{figure=figures/RT_TB.eps, width=0.65\textwidth, height=1.7in}
 \caption{\label{fig:retention} MTJ thermal stability requirement for different retention times}
\end{minipage}
\hfill
\begin{minipage}{0.375\textwidth}
\centering
 \psfig{figure=figures/i_vs_t.eps, width=0.85\textwidth, height=1.7in}
 \caption{\label{fig:currentVStime} Write current versus write pulse width for three MTJs with $10years$, $1sec$, and $10ms$ retention at $125\,^{\circ}\mathrm{C}$ }
\end{minipage}
\end{figure*}


% \begin{figure*} [t]
% \centering
% \centering
 % \psfig{figure=figures/i_vs_t.eps, width=3.4in, height=1.9in}
 % \caption{\label{fig:currentVStime} Write current versus write pulse width for three MTJs with $10years$, $1sec$, and $10ms$ retention at $125\,^{\circ}\mathrm{C}$}
% \end{figure*}

\begin{table*}[t]
 \scriptsize
  \centering
  \caption{16-way L2 Cache Simulation Results}
  \label{allcaches}
  \begin{tabular}{| c | c | c | c | c | c | c | c |}
    \hline\hline
    \multirow{2}{*}{} & & Area  & Read Latency & Write Latency & Read Energy & Write Energy & Leakage Power\\
  & & ($mm^2$) & ($ns$) & ($ns$) & ($nJ$) & ($nJ$) & ($mW$) \\
    \hline\hline
    \multicolumn{2}{|c|}{$1MB$ SRAM} & $2.612$ & $1.012$ & $1.012$ & $0.578$ & $0.578$ & $4542$ \\
    \hline
    \multirow{3}{*}{$4MB$ STT-RAM} & $t=10yr$ & $3.003$ & $0.998$ & $10.61$ & $1.035$ & $1.066$ & $2524$ \\
%   \cline{2-2}\cline{3-8}
    & {$t=1s$} & $2.904$ & $0.973$ & $5.571$ & $1.015$ & $1.036$ & $2235$ \\
%   \cline{2-2}\cline{3-8}
    & {$t=10ms$} & $2.901$ & $0.959$ & $2.598$ & $1.002$ & $1.028$ & $2227$ \\
    \hline\hline
  \end{tabular}
\end{table*}

Based on the above device level characterizations and analytical models, we simulate SRAM-based
caches and STT-RAM-based caches with a tool called NVsim~\cite{CACTI:PCRAMsim}. NVsim is a
circuit-level performance, energy, and area simulator based on CACTI for emerging non-volatile
memories. We integrated all the models described in the previous sub-sections in NVsim.
Table~\ref{allcaches} shows the architectural parameters based on our STT-RAM models. It shows the
read, write times and energy numbers of three stable operating points A, B', and C' for MTJs with different retention times. We find that a 4MB NVM STT-RAM cache occupies similar chip area as 1MB SRAM. This is consistent with previous work~\cite{CACTI:DAC08:Dong}. For the leakage simulation, we didn't apply any power gating techniques for the cache banks and hence our leakage numbers are higher than the previously published numbers in recent STT-RAM papers. The results show that STT-RAM consume almost half the leakage power of SRAM. That is basically because half of STT-RAM die area is occupied by peripheral circuitry which means half of the chip is leaky.

By relaxing retention time of STT-RAM with lower thermal barrier, the STT-RAM cache can have smaller
area, lower write latency and less leaky peripheral circuitry. However, since retention time is
exponentially related with thermal barrier, and thermal barrier is highly sensitive to process
variation and temperature, the benefit of decreasing write latency by relaxing the retention time in
the same order of magnitude is so small, that it can be easily offset by slight variation in device
geometry or environmental temperature. For example, the difference between thermal barrier for $10ms$ and
$50ms$ retention MTJ is only about 5\%.
%The simulation results are listed in Table~\ref{allcaches}
%which shows the read, write times and energy numbers for three different retention times. The SRAM
%numbers are also given in the table for reference.
Based on this analysis, we propose $10years$, $1sec$ and $10ms$ STT-RAM devices for our
LLC design. Also, we understand that there are many alternatives of MTJ structure designs. Since, the
characteristics of most of these MTJs sit between pure in-plane MTJ and perpendicular MTJ, we tried
to illustrate our architecture-level solution for feasible/fabricated in-plane MTJ. Our solution 
can easily be tweaked for any other device. In later sections, we will analyze these designs and will propose that $10ms$ is ideal both from device as well as application side.
